Exploratory Data Analysis (EDA) plays a very important role in understanding the dataset. Whether you are going to build a Machine Learning Model or if it's just an exercise to bring out insights from the given data, EDA is the primary task to perform. While it's undeniable that EDA is very important, The task of performing Exploratory Data Analysis grows in parallel with the number of columns your dataset has got.
This is a generic exploratory data analysis notebook which will serve as a guideline in your future data exploration endeavours. You can always build on this and add more analysis/graphs according to your dataset/requirements.
For prerequisites we import the necessary libraries and load the files needed for our EDA
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
# Comment this if the data visualisations doesn't work on your side
%matplotlib inline
plt.style.use('bmh')
Loading dataset in a data frame and printing top 5 rows to visualize attributes
This is a house sale price data set with 81 columns and 1460 records with SalePrice (Continuous) as the class label
df = pd.read_csv(r'C:\Users\Saif\Desktop\ProHack Competition\train.csv', na_values=['NaN'])
df.head()
Pandas.info() Shows # of rows, # of columns, non-null values and data type for each column, unique data types in the dataset with their numbers and memory occupied by the data frame
df.info()
Pandas.describe() shows summary statistics for numerical data in the dataframe. In other words this shows the five number summary for each data variable
df.describe()
#show the stats in tabular form
df.describe().T
Pandas.describe() for object(categorical) variables in dataframe
df.describe(include=['object']).T
#columns_with_missing_values contains missing values column name along with missing value percentage
columns_with_missing_values=pd.DataFrame()
columns_with_missing_values['columns_name']=df.columns
columns_with_missing_values["missing_value"]=0
for i in columns_with_missing_values['columns_name']:
columns_with_missing_values["missing_value"][columns_with_missing_values['columns_name']==i]=df[i].isnull().sum()/len(df[i])*100
columns_with_missing_values=columns_with_missing_values.sort_values('missing_value',ascending=True)[columns_with_missing_values['missing_value']!=0.0]
#resetting index
columns_with_missing_values=columns_with_missing_values.reset_index(drop=True)
#plotting column name along with missing percentage
plt.figure(figsize=(18, 8))
plt.barh(columns_with_missing_values['columns_name'],columns_with_missing_values['missing_value'])
Getting rid of features with more than 70% missing values + removing ID column
'''
Pandas.count() does not include NaN values
'''
#identify columns with more than 25% of the values
df2 = df[[column for column in df if df[column].count() / len(df) >= 0.25]]
'''
Deleting ID column
'''
# del df2['Id']
'''
Printing list of dropped columns
'''
print("List of dropped columns:", end=" ")
for c in df.columns:
if c not in df2.columns:
print(c, end=", ")
print('\n')
df = df2
Distribution of class label
print(df['y'].describe())
plt.figure(figsize=(9, 8))
sns.distplot(df['y'], color='b', bins=50, hist_kws={'alpha': 0.6});
Distributions of other numerical variables
list(set(df.dtypes.tolist()))
df_num = df.select_dtypes(include = ['float64', 'int64'])
df_num.head()
df_num.hist(figsize=(60, 80), bins=50, xlabelsize=10, ylabelsize=10);
We find correlation of numerical attributes with class label y displayed in descending order of their correlations
df_num_corr = df_num.corr()['y'][:-1]
golden_features_list = df_num_corr.sort_values(ascending=False)
print(golden_features_list)
Correlation Heat Map
corr = df_num.drop('y', axis=1).corr() # We already examined class label y correlations
plt.figure(figsize=(15, 15))
sns.heatmap(corr, vmax=1.0, vmin=-1.0, linewidths=0.1,
annot=True, annot_kws={"size": 4}, square=True);
Feature distributions with respect to class label
for i in range(0, len(df_num.columns), 3):
sns.pairplot(data=df_num,
x_vars=df_num.columns[i:i+3],
y_vars=['y'], size=5)
Feature relationships with class label
features_to_analyse = list(df_num)
fig, ax = plt.subplots(round(len(features_to_analyse) / 3), 3, figsize = (18, 72))
for i, ax in enumerate(fig.axes):
if i < len(features_to_analyse) - 1:
sns.regplot(x=features_to_analyse[i],y='y', data=df[features_to_analyse], ax=ax)
# quantitative_features_list[:-1] as the last column is y and we want to keep it
df_cat = df.select_dtypes(include = ['O'])
df_cat['y'] = df['y']
categorical_features = list(df_cat)
df_cat.head()
Box Plots for categorical features agains class label
plt.figure(figsize = (10, 6))
ax = sns.boxplot(x='galaxy', y='y', data=df_cat)
plt.setp(ax.artists, alpha=.5, linewidth=2, edgecolor="k")
plt.xticks(rotation=45)
Most real-world datasets have more than one feature. Each of them can be considered as a dimension in the space of data points. Consequently, more often than not, we deal with high-dimensional datasets, where entire visualization is quite hard.
To visualize the dataset as a whole, we need to decrease the number of dimensions used in visualization without losing much information about data. This task is called dimensionality reduction and is an example of an unsupervised learning problem because we need to derive new, low-dimensional features from the data itself, without any supervised input.
One of the well-known dimensionality reduction methods is Principal Component Analysis (PCA), which is covered in the previous lectures. Its limitation is that it is a linear algorithm that implies certain restrictions on the data.
There are also many non-linear methods, collectively called Manifold Learning. One of the best-known of them is t-SNE
Basic idea is to find a projection for a high-dimensional feature space onto a plane (or a 3D hyperplane, but it is almost always 2D) such that those points that were far apart in the initial n-dimensional space will end up far apart on the plane. Those that were originally close would remain close to each other. Please refer to the t-SNE
#scaling the dataset
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df_num.fillna(0))
#tsne modeling
tsne = TSNE(n_components=2)
tsne_repr = tsne.fit_transform(X_scaled)
# converting into dataframe
tsne_repr = pd.DataFrame(tsne_repr, columns=['dim'+str(i) for i in range(1,3)])
# converting continous variable to class labels using binning
df['y']=pd.cut(np.array(df['y']),2,labels=["low_well_being", "high_well_being"])
label_encoder=LabelEncoder()
df['y']=label_encoder.fit_transform(df['y'])
#ploting it w.r.t to class variable i.e. assigning each class label colour
plt.figure(figsize=(9, 9))
plt.scatter(np.array(tsne_repr['dim1']), np.array(tsne_repr['dim2']),alpha=0.4,s=70,c=df['y'].map({0: 'green', 1: 'red'}))
For each column the following statistics - if relevant for the column type - are presented in an interactive HTML report:
# import pandas_profiling
# pandas_profiling.ProfileReport(df)